-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download Latin.unicharset along with radical-stroke.txt #219
Conversation
All unicharset files for scripts are potentially needed, starting with I usually get the required ones to satisfy the error message(s), but still don't know what happens if they are missing. |
I added only Latin and Inherited unicharsets in this list because these are required in almost all cases, even though they don't stop processing like missing radical-stroke.txt. We could add another optional variable for SCRIPT_UNICHARSET, downloading it when it is non-blank.
I think some characters e.g. Arabic accents get dropped in the generated unicharset by unicharset_extractor. That was the reason I built the Inherited.unicharset. |
Makefile
Outdated
@@ -303,6 +303,8 @@ $(OUTPUT_DIR).traineddata: $(LAST_CHECKPOINT) | |||
endif | |||
|
|||
$(DATA_DIR)/radical-stroke.txt: | |||
# wget -O $(DATA_DIR)/Inherited.unicharset 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/Inherited.unicharset' | |||
wget -O $(DATA_DIR)/Latin.unicharset 'https://github.com/tesseract-ocr/langdata_lstm/raw/master/Latin.unicharset' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put that in a separate Makefile target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inherited.unicharset is NOT there in langdata_lstm repo. I created it by copying the lines with Inherited from other unicharsets. But there are some differences in coordinates for same character in different unicharsets, so I am not sure which one is to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi
how can I get the Inherited.unicharset
A list of all required
|
Thanks for the suggestions @stweil and the hint to get the list of required unicharsets from $(OUTPUT_DIR)/unicharset. I am having a hard time putting it together in a separate Makefile target using the list. Would appreciate if you can make the required change. Here is what I have tried so far:
|
@kba Could you pls. have a look at the change request and maybe come up with a proposal? |
I added A simpler way maybe asking the user to specify a script and download that. |
I have tried that in the new Makefile-font2model |
Included as part of #230 |
Need another PR to add Inherited.unicharset after tesseract-ocr/langdata_lstm#41 is merged